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poor scaling behavior somewhat dampened this enthusiasm, 
and this behavior was traced to the use of soft constraints. The 
next generation of networks, using hard constraints, seemed to 
scale well (Peterson, 1990). Moreover, it also seems possible for 
algorithms with soft constraints to scale well, provided atten- 
tion is paid to the values of the coefficients in the bias terms 
(Aiyer et al., 1990). 

Neural network optimization has developed into a good 
general-purpose heuristic algorithm that is competitive with 
other optimization procedures. On weU-studied problems like the 
TSP, however, it still remains inferior to the best computer 
science algorithms. 

There is some difficulty, however, in comparing neural net- 
work algorithms fairly to existing computer algorithms. The 
neural network algorithms are designed to work on special- 
purpose analog hardware, and they suffer when these equa- 
tions are discretized for serial computation. A new generation 
of neural network algorithms — discrete in time, though paral- 
lel and using analog variables (Waugh and Westervelt, 1993)— 
may be more directly comparable. 

Finally, recent theoretical analysis (Yuille and Kosowsky, 
1994) shows some precise relations between neural network 
optimization and more traditional optimization methods using 
barrier functions and interior point methods (Gill et al., 1986). 



Appendix: Desirable Background Knowledge 

The mean-field equations can be obtained in at least three differ- 
ent ways (see Reif, 1982, and Hertz et al., 1991): 

1. The saddle-point approximation. This is a standard technique in 
statistical mechanics. The partition function Z = XV e"'* 1 * 1 is 
transformed into an integral form Z = J [dS] [dU]e' tE ^ s - v - n , 
where the S variables correspond to the mean fields of the V 
and the U are auxiliary variables which impose constraints 
on the S. E^IS, U:0]is extremized with respect to S and U; 
then the mean fields are approximated by the resulting values 
S*. 

2. Approximating the free energy. The free energy F= -TlogZ. 
From thermodynamic relations, it can be expressed as U - 
TS, where U is the energy and S is the entropy. The expecta- 
tion of the free energy, with respect to the Gibbs distribu- 
tion, is approximated to give the effective energy. 

3. Mean-field techniques. When evaluating the partition func- 
tion, one evaluates the contribution from each variable by 
fixing the remaining variables at their mean values. This 
leads to consistency conditions for the mean fields, the mean- 
field equations. 



Note that physicists have established conditions under which 
these approximations are valid. Unfortunately, these situations 
typically involve making the system infinitely large and impos- 
ing undesirable conditions on the form of the energy function. 
For our purposes, however, we are interested only in the weaker 
result that these approximations are valid at zero temperature. 
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Convolutional Networks for Images, Speech, and Time Series 

Yann LeCun and Yoshua Bengio 



Introduction 

The ability of multilayer backpropagation networks to learn 
complex, high-dimensional, nonlinear mappings from large col- 
lections of examples makes them obvious candidates for image 
recognition or speech recognition tasks (see Pattern Recogni- 
ta)- A typical approach is to feed the network with "raw" 



inputs (e.g., normalized images) and to rely on backpropaga- 
tion to turn the first few layers into an appropriate feature 
extractor. While this can be done with an ordinary fully con- 
nected feedforward network with some success for tasks such 
as character recognition, there are problems. 

First, typical images, or spectral representations of spoken 
words, are large, often with several hundred variables. A fully 
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connected first layer with, say, a few hundred hidden units, 
would already contain tens of thousands of weights. Oyerfitting 
problems may occur if training data are scarce. In addition, the 
memory requirement for that many weights may rule out cer- 
tain hardware implementations. But the main deficiency of un- 
structured nets for image or speech aplications is that they have 
no built-in invariance with respect to translations, or local dis- 
tortions of the inputs. Before being sent to the fixed-size input 
layer of a neural net, character images, spoken word spectra, or 
other 2D or ID signals must be approximately size normalized 
and centered in the input field. Unfortunately, no such pre- 
processing can be perfect: handwriting is often normalized at 
the word level, which can cause size, slant, and position varia- 
tions for individual characters; words can be spoken at varying 
speed, pitch, and intonation. This causes variations in the posi- 
tion of distinctive features in input objects. In principle, a fully 
connected network of sufficient size could learn to produce out- 
puts that are invariant with respect to such variations. How- 
ever, learning such a task would probably result in multiple 
units with identical weight patterns positioned at various loca- 
tions in the input. Learning these weight configurations requires 
a very large number of training instances to cover the space of 
possible variations. Conversely, in convolutional networks (which 
will be defined in the next section), shift invariance is automati- 
cally obtained by forcing the replication of weight configura- 
tions across space. 

Second, a deficiency of fully connected architectures is that 
the topology of the input is entirely ignored. The input vari- 
ables can be presented in any (fixed) order without affecting the 
outcome of the training. On the contrary, images, or spectral 
representations of speech, have a strong 2D local structure, and 
time series have a strong ID structure: variables (or pixels) that 
are spatially or temporally nearby are highly correlated. Local 
correlations are the reasons for the well-known advantages of 
extracting and combining local features before recognizing spa- 
tial or temporal objects. Convolutional networks force the ex- 
traction of local features by restricting the receptive fields of 
hidden units to be local. 



Convolutional Networks 

Convolutional networks combine three architectural ideas to 
ensure some degree of shift and distortion invariance: local 
receptive fields, shared weights (or weight replication), and, 
sometimes, spatial or temporal subsampling. A typical convo- 
lutional network for recognizing characters is shown in Figure 
1 (from LeCun et al., 1990). The input plane receives images of 
characters that are approximately size normalized and centered. 
Each unit of a layer receives inputs from a set of units located 
in a small neighborhood in the previous layer. The idea of 
connecting units to local receptive fields on the input goes back 



to the perceptron in the early 1960s, and was almost simultane- 
ous with Hubel and Wiesel's discovery of locally sensitive, 
orientation-selective neurons in the cat's visual system. Local 
connections have been reused many times in neural models of 
visual learning (see Mozer, 1991; LeCun, 1986; and Neocogni- 
tron). With local receptive fields, neurons can extract ele- 
mentary visual features such as oriented edges, endpoints, or 
corners (or similar features in speech spectrograms). These fea- 
tures are then combined by the higher layers. As stated earlier, 
distortions or shifts of -the input can cause the position of sa- 
lient features to vary. In addition, elementary feature detectors 
that are useful on one part of the image are likely to be useful 
across the entire image. This knowledge can 6e applied by forcing 
a set of units, whose receptive fields are located at different 
places on the image, to have identical weight vectors (Rumelhart, 
Hinton, and Williams, 1986). The outputs of such a set of neu- 
rons constitute a feature map. At each position, different types 
of units in different feature maps compute different types of 
features. A sequential implementation of this, for each feature 
map, would be to scan the input image with a single neuron 
that has a local receptive field and to store the states of this 
neuron at corresponding locations in the feature map. This 
operation is equivalent to a convolution with a small-size ker- 
nel, followed by a squashing function. The process can be per- 
formed in parallel by implementing the feature map as a plane 
of neurons that share a single weight vector. Units in a feature 
map are constrained to perform the same operation on different 
parts of the image. A convolutional layer is usually composed 
of several feature maps (with different weight vectors), so that 
multiple features can be extracted at each location. The first 
bidden layer in Figure 1 has four feature maps with 5 x 5 re- 
ceptive fields. Shifting the input of a convolutional layer will 
shift the output but will leave it unchanged otherwise. Once a 
feature has been detected, its exact location becomes less im- 
portant, as long as its approximate position relative to other 
features is preserved. Therefore, each convolutional layer is 
followed by an additional layer which performs a local aver- 
aging and a subsampling, reducing the resolution of the feature 
map, and reducing the sensitivity of the output to shifts and 
distortions. The second hidden layer in Figure 1 performs 2x2 
averaging and subsampling, followed by a trainable coeffi- 
cient, a trainable bias, and a sigmoid. The trainable coefficient 
and bias control the effect of the squashing nonlinearity (for 
example, if the coefficient is small, then the neuron operates in 
a quasi-linear mode). Successive layers of convolutions and sub- 
sampling are typically alternated, resulting in a bi-pyramid: at 
each layer, the number of feature maps is increased as the spa- 
tial resolution is decreased. Each unit in the third hidden layer 
in Figure 1 may have input connections from several feature 
maps in the previous layer. The convolution/subsampling com- 
bination, inspired by Hubel and Wiesel's notions of "simple" 
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and "complex" cells, was implemented in the neocognitron model 
(see Neocognitron), although no globally supervised learning 
procedure such as backpropagation was available then. 

Since all the weights are learned with backpropagation, con- 
volutional networks can be seen as synthesizing their own fea- 
ture extractor. The weight-sharing technique has the interesting 
side effect of reducing the number of free parameters, thereby 
reducing the "capacity" of the machine and improving its gen- 
eralization ability (see LeCun, 1989, on weight sharing, and 
Learning and Generalization: Theoretical Bounds for an 
explanation of generalization). The network in Figure 1 con- 
tains approximately 100,000 connections, but only approxi- 
mately 2600 free parameters because of the weight sharing. 
Such networks compare favorably with other methods on hand- 
written character recognition tasks (Bottou et a!., 1994; see also 
Handwritten Digit String Recognition), and they have been 
deployed in commercial applications. 

Fixed-size' convolutional networks that share weights along 
a single temporal dimension are known as Time-Delay Neural 
Networks (TDNNs). TDNNs have been used in phoneme rec- 
ognition (without subsampling) (Lang and Hinton, 1988; Waibel 
et al., 1989), spoken word recognition (with subsampling) (Bottou 



etal., 1990), and o 
1991). 



Variable-Size 



-line handwriting recognition (Guyon et al., 



Networks: SDNNs 



While characters or short spoken words can be size normalized 
and fed to a fixed-size network, more complex objects such as 
written or spoken words and sentences have inherently variable 
size. One way of handling such a composite object is to seg- 
ment it heuristically into simpler objects that can be recognized 
individually (e.g., characters, phonemes). However, reliable seg- 
mentation heuristics do not exist for speech or cursive hand- 
writing. A brute-force solution is to scan (or replicate) a recog- 
nizer at all possible locations across the input. While this can 
be prohibitively expensive in general, convolutional networks 
can be scanned or replicated very efficiently over large, variable- 
size input fields. Consider one instance of a convolutional net 
and its alter ego at a nearby location. Because of the convo- 



lutional nature of the networks, units in the two nets that look 
at identical locations on the input have identical outputs; there- 
fore, their output does not need to be computed twice. In effect, 
replicating a convolutional network can be done simply by in- 
creasing the size of the field over which the convolutions are 
performed, and replicating the output layer, effectively making 
it a convolutional layer. An output whose receptive field is 
centered on an elementary object will produce the class of this 
object, while an in-between output may be empty or contain 
garbage. The outputs can be interpreted as evidence for the 
categories of object centered at different positions of the input 
field. A postprocessor therefore is required to pull out con- 
sistent interpretations of the output. Hidden Markov models 
(HMMs) or other graph-based methods are often used for that 
purpose (see Speech Recognition: Pattern Matching, and 
Pattern Recognition in this volume). The replicated network 
and the HMM can be trained simultaneously by backpropa- 
gating gradients through the HMM. Globally trained, 
variable-size TDNN/HMM hybrids have been used for speech 
recognition (see Pattern Recognition for a list of references) 
and on-line handwriting recognition (Schenkel et al., 1993). 
Two-dimensional replicated convolutional networks, called Space 
Displacement Neural Networks (SDNNs) (Figure 2), have been 
used in combination with HMMs or other elastic matching 
methods for handwritten word recognition (Keeler, Rumelhart, 
and Leow, 1991; Matan et al., 1992; Bengio, LeCun, and Hen- 
derson, 1994). Another interesting application of SDNNs is 
object spotting (Wolf and Piatt, 1994). 

An important advantage of convolutional neural networks 
is the ease with which they can be implemented in hardware. 
Specialized analog/digital chips have been designed and used 
in character recognition and in image-preprocessing applica- 
tions (Boser et al., 1991). Speeds of more than 1000 characters 
per second were obtained with a network with approximately 
100,000 connections. 

The idea of subsampling can be turned around to construct 
networks that are similar to TDNNs but can generate sequences 
from labels. These networks are called reverse-TDNNs because 
they can be viewed as upside-down TDNNs: temporal resolu- 
tion increases from the input to the output, through alternated 
oversampling and convolution layers (Simard and LeCun, 1992). 



i. 



Figure 2. A variable-size replicated con- 
volutional network, called a Space Dis- 
lt Neural Network (SDNN). 
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Convolutionai neural networks are a good example of an idea 
inspired by biology that resulted in competitive engineering 
solutions that compare favorably with other methods (Bottou 
et al., 1994). Although applying convolutionai nets to image 
recognition removes the need for a separate hand-crafted fea- 
ture extractor, normalizing the images for size and orientation 
(if only approximately) is still required. Shared weights and 
subsampling bring invariance with respect to small geometric 
transformations or distortions, but fully invariant recognition 
is still beyond reach. Radically new architectural ideas, possi- 
bly suggested by biology, will be required for a fully neural 
image or speech recognition system. 



References 

Bengio, Y., LeCun, Y., and Henderson, D., 1994, Globally trained 
handwritten word recognizer using spatial representation, space dis- 
placement neural networks and hidden Markov models, in Advances 
in Neural Information Processing Systems 6 (J. Cowan, G. Tesauro, 
and J. Alspector, Eds.), San Mateo, CA: Morgan Kaufmann, 
pp. 937-944. 

Boser, B., Sackinger, E., Bromley, J., LeCun, Y., and Jackel, L„ 1991, 
An analog neural network processor with programmable topology, 
IEEE J. Solid-State Circuits, 26:2017-2025. 

Bottou, L., Cortes, C, Denker, J., Drucker, H., Guyon, I., Jackel, L., 
LeCun, Y., Muller, U., Sackinger, E., Simard, P., and Vapnik, V., 
1994, Comparison of classifier methods: A case study in handwritten 
digit recognition, in Proceedings of the International Conference on 
Pattern Recognition, Los Alamitos, CA: IEEE Computer Society 
Press. 

Bottou, L., Fogelman-Soulie, F., Blancbet, P., and Lienard, J. S., 1990, 
Speaker independent isolated digit recognition: Multilayer percep- 
trons vs dynamic time warping, Neural Netw., 3:453-465. 

Guyon, I., Albrecht, P., LeCun, Y., Denker, J. S., and Hubbard, W., 
1991, Design of a neural network character recognizer for a touch 
terminal, Pattern Recognition, 24(2):105-119. 

Keeler, J., Rumelhart, D., and Leow, W., 1991, Integrated segmenta- 
tion and recognition of hand-printed numerals, in Advances in Neu- 



ral Information Processing Systems 3 (R. P. Lippman, J. M. Moody, 
and D. S. Touretzky, Eds.), San Mateo, CA: Morgan Kaufmann, 
pp. 557-563. 

Lang, K„ and Hinton, G., 1988, The Development of the Time-Belay 
Neural Network Architecture for Speech Recognition, Technical Re- 
port CMU-CS-88-152, Carnegie-Mellon University, Pittsburgh. 

LeCun, Y., 1986, Learning processes in an asymmetric threshold net- 
work, in Disordered Systems and Biological Organization (E. Bienen- 
stock, F. Fogelman-Soulie, and G. Wcisbuch, Eds.), Les Houches, 
France: Springer-Verlag, pp. 233-240. 

LeCun, Y., 1989, Generalization and Network Design Strategies; Tech- 
nical Report CRG-TR-89-4, Department of Computer Science, 
University of Toronto. 

LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, 
W., and Jackel, L„ 1990, Handwritten digit recognition with a back- 
propagation network, in Advances in Neural Information Processing 
Systems 2 (D. Touretzky, Ed.), San Mateo, CA: Morgan Kaufmann, 
pp. 396-404. 

Matan, O., Burges, C, LeCun, Y, and Denker, J., 1992, Multi-digit 
recognition using a space displacement neural network, in Advances 
in Neural Information Processing Systems 4 (J. Moody, S. Hanson, 
and R. Lipmann, Eds.), San Mateo, CA Morgan Kaufmann, 



Mozer, M., 1991, The Perception of Multiple Objects, A Connectionist 
Approach, Cambridge, MA: MIT Press. 

Rumelhart, D., Hinton, G., and Williams, R., 1986, Learning represen- 
tations by back-propagating errors, Nature, 323:533-536, 

Schenkel, M„ Weissman, H., Guyon, I., Nohl, C, and Henderson, D., 
1993, Recognition-based segmentation of on-line hand-printed words, 
in Advances in Neural Information Processing Systems S (C. Hanson 
and L. Giles, Eds.), San Mateo, CA: Morgan Kaufmann, pp. 723- 
730. 

Simard, P., and LeCun, Y., 1992, Reverse TDNN: An architecture for 
trajectory generation, in Advances in Neural Information Processing 
Systems 4 (J. Moody, S. Hanson, and R. Lipmann, Eds.), San Mateo, 
CA: Morgan Kaufmann, pp. 579-588. 

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K., 
1989, Phoneme recognition using time-delay neural networks, IEEE 
Trans. Acoustics, Speech, Signal Processing, 37:328-339. 

Wolf, R., and Piatt, J., 1994, Postal address block location using a 
convolutionai locator network, in Advances in Neural Information 
Processing Systems 6 (J. Cowan, G. Tesauro, and J. Alspector, Eds.), 
San Mateo, CA: Morgan Kaufmann, pp. 745-752. 



Cooperative Behavior in Networks of Chaotic Elements 

Kunihiko Ktmeko 



Chaos in the Brain 

The importance of chaotic dynamics in the brain has been ap- 
preciated both theoretically and experimentally. Chaos is an 
irregular motion appearing in a. deterministic system governed, 
for example, by some differential equations or maps. The ne- 
cessity of chaotic dynamics in the brain can be inferred in the 
following contexts. So far it is the only scientific mechanism 
to connect deterministic and probabilistic views at a macro- 
scopic level. By chaotic dynamics, a tiny perturbation can be 
amplified to a macroscopic level. Second, chaos is the only 
mechanism so far to create complexity from simple rules. For 
other examples of complex behavior, such as the spin glass 
model or cellular automaton, we need a huge number of degrees 
of freedom or rules (often made from "random couplings," 
externally assigned, which also require much information). The 



necessity of chaotic dynamics for biological information pro- 
cessing has been discussed by Tsuda (1992). 

Experimentally, the existence of chaotic dynamics in the brain 
has been confirmed. Even a single neuron or a small ensemble 
of neurons can show a chaotic time series or frequency locking 
(Hayashi, Nakao, and Hirakawa, 1982). Some data of EEG 
time series show irregular motion, which is found to be chaotic 
with the aid of reconstruction of attractors or dimension esti- 
mates (Basar, 1990; Freeman, 1986). Partial synchronization of 
nonlinear oscillations has been discovered in the visual cortex 
(Gray et al., 1989; Eckhorn, 1991; see also Synchronization 
of Neuronal Responses as a Putative Binding Mechanism). 
Although the nature of their nonlinear oscillation has not yet 
been clarified, the dynamics strongly suggests the existence of 
chaos (but see Chaos in Neural Systems). Indeed, the signifi- 
cance of chaos can be deduced there; both the synchronization 
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connected first layer with, say, a few hundred hidden units, 
would already contain ten'; ol' thousands of weights. Overfilling 
problems may occur if training data arc scarce. In addition, the 
memory requirement for that many weights may rule out cer- 
tain hardware implementations. But the main deficiency of un- 
structured nets for image or speech aplications is that they have 
no built-in invarianee with respect to translations, or local dis- 
tortions of the inputs. Before being sen! to the fixed-size input 
layer of a neural net, character images, spoken word spectra, or 
other 2D or ID signals must be approximately size normalized 
and centered in the input field. Unfortunately, no such pre- 
processing can be perfect: handwriting is often normalized at 
the word level, which can cause size, slant, and position varia- 
tions for individual characters; words can be spoken at varying 
speed, pilch, and intonation. This causes variations in the posi- 
tion of distinctive features in input objects. In principle, a fully 
connected network of sufficient size could Icam to produce out- 
puts that arc invariant with respect to such variations. How- 
ever, learning such a task would probably result in multiple 
units with identical weight patterns positioned at various loca- 
tions in the input. Learning these, weight configurations requires 
a very large number of training instances to cover the space or 
possible variations Conversely, in conventional networks (which 
will be defined in the next section), shift invarianee is automati- 
cally obtained by forcing the replication of weight configura- 
tions across space 

Second, a deficiency of fully connected architectures is that 
the topology or the input is entirely ignored. The input vari- 
ables can be presented m any (.fixed) order without affecting the 
outcome of the training. On the contrary', images, 
representations of speech, have a .strong 2D local sir 
time scries have a strong ID structure: variables {or pixel: 
are spatially or temporally nearby are highly correlated, 
correlations arc the reasons for the well-known advanU 
extracting and combining local features before recognizing 
tial or temporal objects. Convolutionul networks force 
traction of local features by restricting the receptive fields < 
hidden units to be l< ca! 

Convolutions! Networks 

Convolutional networks combine three architectural ideas to 
ensure some degree of shift and distortion invarianee: local 
receptive fields, shared weights (or weight replication), and, 
sometimes, spatial or temporal subsampting. A typical convo- 
lutional network l.-r recognizing characters is shown in Figure 
1 (from LeCun et a!.. 1990). The input plane receives images of 
characters that arc approximately size normalized and centered. 
F.aeh unit of a layer receives inputs from a set of units located 
in a small iieigliborhiu..! :a the pievioas lave: The idea, ol 
connecting units to local receptive fields on the input goes back 




to the perceptron in the early 1960s, and was almost simultane- 
ous with Hubel and Wiesel's discovery of locally sensitive, 
orientation-selective neurons in the cat's visual system. Local 
connections have been reused many f ines in neural models of 
uv.ail ;e.irnmgiv:e Mozer 1991, LcCun. 1986; and NKOCOCNt- 
tron). With local receptive fields, neurons can extract ele- 
mental > visual features such as oriented edges, endpoints, or 
corners (or .similar features in speech spectrograms). These fea- 
tures are then combined by the higher layers. As stated earlier, 
distortions or shifts of the input can cause the position of sa- 
lient features to vary. In addition, elementary feature detectors 
thai are useful on one pari of the image are likely to be useful 
across the enure image This knowledge can be applied by forcing 
a set of units, whose receptive fields are located at dilfcreni 
places on the image, to have idcntit.il weight vector, (Rumelhari, 
Hintou, and Williams, 19K6) The outputs of such a set of neu- 
rons constitute a feature map. At each position, different types 
of units in different feature maps compute different types of 
features. A sequential implementation of this, for each feature 
map, would be to scan the input image with a single neuron 
lhat has a local receptive field and to store the state-, of this 
neuron at corresponding locations in the feature map. This 
operation is equivalent to a convolution with a small-size ker- 
nel, followed by a squashing function. The process can be per- 
formed in parallel by implementing the feature map as a plane 
of neurons lhat share a single weight vector. Units in a feature 
map are constrained to perform the same operation on different 
parts of ihe image. A convolulional layer is usually composed 
of several feature maps (with different weight vectors), so that 
features can be extracted at each location. The first 
layer in Figure 1 has four feature maps with 5 x 5 re- 
fields. Shifting the input of a convolutional layer will 
output but ,vi)l leave il unchanged otherwise. Once a 
has been detected, its exact location becomes less im- 
as long as its approximate position relative to other 
is preserved. Therefore, each convolutional layer is 
id by an additional layer which performs a local, aver- 
t .aid i Mihsamplmg, leducing the resolution ol the featine 
mapTand reducing the sensitivity of the output to shifts and 
distortions. The second hidden layer in Figure 1 performs 2 x l 
averaging and subsainpling, followed by a trainable coeffi- 
cient, a trainable bias, and a sigmoid. The trainable coefficient 
and bias control the effect of ihe squashing nonlinearity (for 
example if tl c coefficient is small, then the neuron operates in 
a quasi-linear mode). Successive layers of convolutions and sub- 
sampling arc typically alternated, resulting in a bi-pyrumki. at 
each layer, the number of feature maps is increased as the spa- 
tial resolution is decreased. Each unit in the tliird hidden layer 
in Figure 1 may have input connections from several feature 
naps in the previous layer, t he convolulion.'subsamphng com- 
bination, inspired by Hubel and Wiesel's notions of "simple" 
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and "complex*' cells, was implemented m the neocogniiron model 
(see Neocognitron), although no globally supervised learning 
procedure such as backpropagalion was available 'hen. 

Since all the weights are learned with backpropagalion. won- 
volutional networks can be seen as synthesizing their own fea- 
ture extractor. The weight-sharing technique has the interesting 
side effect of reducing the number ot" tree parameters, thereby 
reducing the "capacity" of the machine and improving its gen- 
eralization ability (see LcCua. l'.W). or. wcient sharing, and 

Ll.AKNINU AM; Gfc.NhRALI/AIIO.v Till OK! ft. At. BOI'SM lor an 

explanation of generalization). The network in Figure 1 con- 
tain) approximately IM).i'M>i connections but only approxi- 
mately 2600 free parameters because of the weight sharing. 
Such networks compare favorably with other methods on hand- 
written character recognition txsKs i Baton el ..I , I •'<'!.;, sic also 
H ANDWRrn en Diorr String Recugni hon), and they have been 

deployed in commercial applied v 

Fixed-size convoluiional networks that share weights along 
;i single lempoial dimension ate known as Time-Delay Neural 
Networks (TDNNs!. TDNNs have been used in phoneme rec- 
ognition (without subsampling) (Lang and Hinlon. 1988; Waibel 
et al., 1989), spoken word iceogniuon (with subsampling) (Bottou 
et al., 1990), and on-line handwriting recognition (Cuvon el al., 
1991), 



Variable-Size Convolutional Networks: SDNNs 

While characters or short spoken words can be size normalized 
and fed to a fixed-size network, more complex objects such as 
written or spoken words and sentences have inherently variable 
size. One way of handling such a composite object is to seg- 
ment it heuristically into simpler objects that can be recognized 
individually (e.g., charade is. p)ioi;e:ne>l Houcei. icIiaMc s.-g- 
mentation heuristics do not exist for speech or cursive hand- 
writing. A brute-force solution is to scan (or replicate) a recog- 
nizer at all possible locations across the input. While this can 
be prohibitively expensive in general, convolutional networks 
can be scanned or icplicaied very efficiently over large, variable- 
size input fields. Consider one instance of a convolutional net 
and its alter ego at a nearby location Because of the convo- 



luiiona: nature of the networks, units in the two nets that look 
at identical locations on the input have identical outputs: there- 
fore, (heir output docs noi need to be computed fa ice. In effect, 
replicating a convolutional network can be done simply by in- 
creasing '.lie size of liie field over which the convolutions are 
performed, and replicating the output layer, effectively making 
it a eorr.oiutionu! layer An output whose receptive field is 
centered on an elementary object will produce the class of this 
object, while an in-between output may be empty or contain 
garbage. The outputs can be inteipicted as evidence for the 
categoric* ot 'object centered at different positions of the input 
field. A postprocessor thetefore is required to pull out con- 
sistent interpretations of ihe output. Hidden Markov models 
(HMMs) or other graph-based methods are often used for that 
purpose (see Svucn Recognition: Pattern Matching, and 
Pattern Recognition in this volume). The replicated network 
and the HMM can be trained simultaneously by backpropa- 
gaiine gradients through the flMM. Globallv trained, 
variable-size TDNN/HMM hybrids have been tiscd'for speech 
recognition I see Pa iters Kit oGSittos foi a hsi of references) 
and on-line handwriting recognition (Schenkcl et al . 1993). 
!\ui-diiivr;s;oria] replicated eonxolutional ne".w,rk\ called .X'/ii/iv 
Displacement Neural Networks I SDNNs) I figure 2), have been 
used in combination with HMMs or other elastic matching 
methods for handwritten word recognition (Keeler, Rumelharl, 
and Leow. 1991; Malan et al.. 1992; Bengio, LeCun, and Hen- 
derson. 1994). Another interesting application of SDNNs is 
object spotting i. Wolf and Piatt, 1994). 

An mipoiianl advantage of convolutional neural networks 
is the case with which they can be implemented in hardware. 
Spe-:a'.i/'.o analog digital chips have been designed and used 
in character recognition and in image-preprocessing applica- 
tions (Boser et a!., 1991). Speeds of more than 1000 characters 
per second were obtained with a network with approximately 
100.000 connections. 

'l"hc idea of subsampling can be turned around to construct 
networks that arc similar lo TDNNs but can generate sequences 
from labels. These networks are called reversc-TDNNs because 
iliey can be viewed as upside-down TDNNs: temporal resolu- 
tion increases from the input lo the output, through alternated 
ovcrsamphng and convolution layers (Simard and LeCun, 1992). 
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